{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 金融领域中的自然语言处理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NLP本身是人工智能中的一个重要的方向，简单来说，处理自然语言的过程就是让机器去理解人的文本或语言，其中如翻译、语音识别、语义理解、智能问答，知识图谱等都属于NLP的范畴。\n",
    "\n",
    "自计算机诞生伊始，人类就致力于让机器来理解我们语言。随着人工智能、计算机科学、信息工程、统计学、甚至语言学等学科知识的不断进步，目前NLP已经拥有了大量的商业应用，如机器翻译（Google翻译、有道翻译等）、知识图谱（以Google为代表的搜索引擎）、智能问答（Apple的Siri、亚马逊的Alexa以及各种智能机器人）等等。\n",
    "\n",
    "但是，金融领域的NLP目前仍处于探索阶段，金融本身是一个专业性很高的领域，很多词汇在金融语境下会产生特殊含义，所有的子问题都会有一个独特的理解方式，而且金融领域衡量处理结果的方式也与其他领域不同。比如针对舆情分析，金融领域要求对市场未来的走势有一定的预见性。\n",
    "\n",
    "因此，金融领域的NLP需要准备特殊的训练数据集，而目前NLP所有方法都是基于大量的数据集基础上，数据集的缺乏也是目前NLP在金融领域所面临的最大问题之一，这也是金融领域高度的专业性与深度导致的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 一个强大的NLP系统能够帮助金融机构解决哪些实际问题？\n",
    "\n",
    "全网舆情监控、产业链分析、让机器帮助金融机构阅读大量新闻。\n",
    "\n",
    "例如，商业银行希望使用更全面的数据进行企业的信贷风险管理，提前感知企业的潜在风险。目前常规的风险评估方法是根据企业公布的年报，并综合信贷员实地调查的结果进行判断，但是由于企业自身风险报出通常具有滞后性，公开信息覆盖度不高，看到的往往只是冰山一角，因此判断风险的手段十分单一。这也是NLP与人工智能可以发挥作用的地方。\n",
    "\n",
    "NLP可以对信息进行多维关系的挖掘，评估企业之间的关系，并通过知识图谱直观呈现企业之间的关联，提前设立预警信号，一旦企业关系网内的相关对象出现任意变动，便可根据关系权重，快速地评估对整个关系网的影响程度。\n",
    "\n",
    "![](http://5b0988e595225.cdn.sohucs.com/images/20180817/1be14f4f13914a80bd3c29ed7e74b4c4.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 金融语义应用场景概念框\n",
    "\n",
    "1. 智能问答和语义搜索\n",
    "\n",
    "智能问答和语义搜索是自然语言处理（NLP）的关键技术，目的是让用户以自然语言形式提出问题，深入进行语义分析，以更好理解用户意图，快速准确获取知识库中的信息。在用户界面上，既可以表现为问答机器人的形式（智能问答），也可以为搜索引擎的形式（语义搜索）。智能问答系统一般包括问句理解、信息检索、答案生成三个环节。基于知识图谱的智能问答相比基于文本的问答更能满足金融业务实际需求。\n",
    "\n",
    "2. 资讯与舆情分析\n",
    "\n",
    "金融资讯信息非常丰富，例如公司新闻（公告、重要事件、财务状况等）、金融产品资料（股票、证券等）、宏观经济（通货膨胀、失业率等）、政策法规（宏观政策、税收政策等）、社交媒体评论等。\n",
    "\n",
    "3. 金融预测和分析\n",
    "\n",
    "基于语义的金融预测即利用金融文本中包含的信息预测各种金融市场波动，它是以NLP等人工智能技术与量化金融技术的结合。\n",
    "\n",
    "4. 文档信息抽取\n",
    "\n",
    "信息抽取是NLP的一种基础技术，是NLP进一步进行数据挖掘分析的基础，也是知识图谱中知识抽取的基础。采用的方法包括基于规则模板的槽填充的方法、基于机器学习或深度学习的方法。按抽取内容分可以分为实体抽取、属性抽取、关系抽取、规则抽取、事件抽取等。\n",
    "\n",
    "5. 自动文档生成\n",
    "\n",
    "自动文档生成指根据一定的数据来源自动产生各类金融文档。常见的需要生成的金融文档如信息披露公告（债券评级、股转书等）、各种研究报告。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### LAC  分词 \n",
    "\n",
    "LAC全称Lexical Analysis of Chinese，是百度自然语言处理部研发的一款联合的词法分析工具，实现中文分词、词性标注、专名识别等功能。\n",
    "\n",
    "代码兼容Python2/3\n",
    "\n",
    "- 全自动安装: ``pip install lac``\n",
    "- 使用百度源安装，安装速率更快：``pip install lac -i https://mirror.baidu.com/pypi/simple``\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['LAC', '是', '个', '优秀', '的', '分词', '工具'], ['百度', '是', '一家', '高科技', '公司']]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from LAC import LAC\n",
    "\n",
    "# 装载分词模型\n",
    "lac = LAC(mode='seg')\n",
    "\n",
    "# 单个样本输入\n",
    "text = u\"LAC是个优秀的分词工具\"\n",
    "seg_result = lac.run(text)\n",
    "\n",
    "# 批量样本输入, 输入为多个句子组成的list，平均速率会更快\n",
    "texts = [u\"LAC是个优秀的分词工具\", u\"百度是一家高科技公司\"]\n",
    "seg_result = lac.run(texts)\n",
    "\n",
    "seg_result "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 结巴分词\n",
    "\n",
    "jieba是一个Python 中文分词组件，参见https://github.com/fxsjy/jieba  \n",
    "\n",
    "可以对中文文本进行分词、词性标注、关键词抽取等功能，并且支持自定义词典。\n",
    "\n",
    "可以直接使用pip来进行安装：\n",
    "``pip install jieba``"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Full Mode: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学\n"
     ]
    }
   ],
   "source": [
    "import jieba\n",
    "text = '我来到北京清华大学'\n",
    "    \n",
    "seg_list = jieba.cut(text, cut_all=True)\n",
    "print(\"Full Mode: \" + \"/ \".join(seg_list)) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[pair('我', 'r'), pair('来到', 'v'), pair('北京', 'ns'), pair('清华大学', 'nt')]\n"
     ]
    }
   ],
   "source": [
    "import jieba.posseg as posseg\n",
    "seg = posseg.lcut(text)\n",
    "print(seg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 关键词提取\n",
    "\n",
    "![](https://upload-images.jianshu.io/upload_images/11482169-b1f9fd2c1fc2bffe.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 基于 TF-IDF 算法的关键词提取\n",
    "\n",
    "  TF-IDF(Term Frequency-Inverse Document Frequency, 词频-逆文件频率)是一种统计方法，用以评估一个词语对于一个文件集或一个语料库中的一份文件的重要程度，其原理可概括为：\n",
    "\n",
    "> 一个词语在一篇文章中出现次数越多，同时在所有文档中出现次数越少，越能够代表该词语很关键\n",
    "\n",
    "计算公式：TF-IDF = TF * IDF，其中：\n",
    "\n",
    "- TF(term frequency, TF)：词频，某一个给定的词语在该文件中出现的次数\n",
    "\n",
    "- IDF(inverse document frequency, IDF)：逆文件频率，如果包含词条的文件越少，则说明词条具有很好的类别区分能力"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TF-IDF\n",
      "反转 0.41\n",
      "A股 0.41\n",
      "50% 0.41\n",
      "起点 0.39\n",
      "季度 0.33\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "#from jieba.analyse import *\n",
    "import jieba.analyse as analyse\n",
    "\n",
    "with open('sample.txt', encoding='utf-8') as f:\n",
    "    data = f.read()\n",
    "    \n",
    "print('TF-IDF')\n",
    "for keyword, weight in analyse.extract_tags(data, withWeight=True, topK=5):\n",
    "    print('{} {}'.format(keyword, np.round(weight,2)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 基于 TextRank 算法的关键词提取\n",
    "\n",
    "TextRank 是另一种关键词提取算法，基于大名鼎鼎的 PageRank，其原理可参见论文—— [TextRank: Bringing Order into Texts](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Textbank\n",
      "银行 1.0\n",
      "起点 0.92\n",
      "反转 0.85\n",
      "处于 0.81\n",
      "季度 0.49\n"
     ]
    }
   ],
   "source": [
    "print('Textbank')\n",
    "for keyword, weight in analyse.textrank(data, withWeight=True, topK=5):\n",
    "    print('{} {}'.format(keyword, np.round(weight,2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### SnowNLP\n",
    "\n",
    "SnowNLP是一个python写的类库，可以方便的处理中文文本内容，是受到了TextBlob的启发而写的，由于现在大部分的自然语言处理库基本都是针对英文的，于是写了一个方便处理中文的类库，并且和TextBlob不同的是，这里没有用NLTK，所有的算法都是自己实现的，并且自带了一些训练好的字典。\n",
    "\n",
    "Git Repo [link](https://github.com/isnowfy/snownlp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "中石化真心棒，我赚了好多钱 0.7961687156385213\n",
      "这个股票简直烂到爆 0.024879725715451606\n"
     ]
    }
   ],
   "source": [
    "from snownlp import SnowNLP # 使用\n",
    "from snownlp import seg  # 分词库\n",
    "from snownlp import sentiment # 情感分词\n",
    "from snownlp import normal #停用词处理\n",
    "text1 = '中石化真心棒，我赚了好多钱'\n",
    "text2 = '这个股票简直烂到爆'\n",
    "\n",
    "print(text1, SnowNLP(text1).sentiments)\n",
    "print(text2, SnowNLP(text2).sentiments)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "生物 1.0\n",
      "提出 0.82\n",
      "投资人 0.78\n",
      "公司 0.59\n",
      "管理层 0.57\n",
      "相信 0.52\n",
      "感情 0.52\n",
      "应该 0.44\n",
      "产品 0.4\n",
      "上市 0.39\n",
      "情感评分（0.6以上为积极，0.2一下为负面）： 0.09\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "text = '12月5日14时30分，在“沃森生物转让泽润生物股权”的电话会上，沃森生物董事长李云春遭投资人猛烈炮轰。除了质疑贱卖子公司，投资人还提出公司应该停牌，甚至提出向监管层举报。“你们把我们这些炒股票的当傻子吗？你看看万泰生物值多少钱，你竟然卖的那么低！你们这些人不相信因果报应吗？”“你们是主动卖泽润的还是泽润管理层逼迫你们卖的？”“泽润产品马上上市了，可以自己造血了，为什么要卖？”对此，公司管理层的回答则是——“我们主动卖的，我们是专业的，我们是对沃森倾注了感情的，请相信我们”。'\n",
    "\n",
    "for keyword, weight in analyse.textrank(text, withWeight=True, topK=10):\n",
    "    print('{} {}'.format(keyword, np.round(weight,2)))\n",
    "s = SnowNLP(text)\n",
    "print(\"情感评分（0.6以上为积极，0.2以下为负面）：\",np.round(s.sentiments,2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>news</th>\n",
       "      <th>sentiments</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2020年12月7日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.713557</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>“十四五”规划推出在即：国产替代迎加速 智能机器人站上风口</td>\n",
       "      <td>0.292492</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>商务部支持集装箱制造企业扩大产能 相关公司望获益(附股)</td>\n",
       "      <td>0.871751</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>又一行业涨价了：这次轮到PCB 景气度或延续至明年一季度(附股)</td>\n",
       "      <td>0.993395</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2020年12月4日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.678260</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2020年12月3日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.659060</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2020年12月2日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.598748</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>12月金股出炉：大金融获强推 顺周期空间仍有余量（名单）</td>\n",
       "      <td>0.644080</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>天空才是极限：比特币创历史新高 概念股名单奉上</td>\n",
       "      <td>0.259305</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>11月券商金股收益放榜：最牛股暴涨近60% 12月名单奉上</td>\n",
       "      <td>0.007122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2020年12月1日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.594757</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>碳排放减少迫在眉睫 关注赛道稀缺龙头（附股）</td>\n",
       "      <td>0.347483</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>后疫情时期，高瓴资本赛道配置：继续重仓电商 加仓云计算领域</td>\n",
       "      <td>0.999990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2020年11月30日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.612599</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2020年11月27日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.582831</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>世界5G大会举行：行业迎多重利好 投资机会有哪些？</td>\n",
       "      <td>0.990744</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>2020年11月26日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.570923</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>稀土价格持续上涨 相关公司或受关注（附股）</td>\n",
       "      <td>0.185043</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>成功挤掉苹果：小米手机全球销量挺进前三 背后有这些供应商</td>\n",
       "      <td>0.868777</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>2020年11月25日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.634730</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>特钢长牛时代来临？相关概念股持续走强</td>\n",
       "      <td>0.991833</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>2020年11月24日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.633205</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>年底冲关又看券商？11只券商股被坚定看好</td>\n",
       "      <td>0.037789</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>小电车蕴含大市场 机构称未来三年单车产量复合增长率将超30%(股)</td>\n",
       "      <td>0.987369</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>2020年11月23日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.624516</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>*ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产</td>\n",
       "      <td>0.515381</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>拉尼娜来袭 一文看清相关行业投资机会（附股）</td>\n",
       "      <td>0.974010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>暴雪肆虐冷空气“发威”：煤炭供应趋紧 这些厂商躺赢？</td>\n",
       "      <td>0.061959</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>2020年11月20日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.581728</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>三大运营商或于年底宣布5G消息商用 产业链标的有望受益（附股）</td>\n",
       "      <td>0.701960</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>军工股午后崛起：航空产业链业绩提升 订单量增速有望扩大</td>\n",
       "      <td>0.931788</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>国常会再提促进家电消费：家电股迎政策红利 两条主线布局</td>\n",
       "      <td>0.809421</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>涨价题材火爆：有机硅价格创年内新高 最全概念股名单来了</td>\n",
       "      <td>0.258146</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>2020年11月19日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.543657</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>前三季中国拿下世界造船业半数订单 成全球重要造船中心(股)</td>\n",
       "      <td>0.828341</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>手机摄像头出货量回暖：多摄趋势加速渗透 产业链有望持续受益</td>\n",
       "      <td>0.890235</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>2020年11月18日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.734892</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>能源工业云网正式发布 赋能能源产业链(附股)</td>\n",
       "      <td>0.043493</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>10月装车辆同比翻倍：磷酸铁锂强势回归 龙头股价迭创新高(股)</td>\n",
       "      <td>0.930371</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>全球首款定制网约车来了：滴滴出行携手比亚迪 概念股站上风口</td>\n",
       "      <td>0.520880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>2020年11月17日涨停板早知道：七大利好有望发酵</td>\n",
       "      <td>0.628731</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <td>有色板块多股涨停：电解铝、稀土价格有望持续修复反弹(附股)</td>\n",
       "      <td>0.130005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42</th>\n",
       "      <td>医美板块大涨：疫情趋稳需求恢复 三条赛道布局医疗美容(股)</td>\n",
       "      <td>0.668574</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <td>疫苗超低温冰柜脱销 冷链板块有望重返高光时刻？(名单)</td>\n",
       "      <td>0.016453</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>全球最大自贸协定达成：零关税产品超90% 概念股名单来了</td>\n",
       "      <td>0.062543</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 news  sentiments\n",
       "0           2020年12月7日涨停板早知道：七大利好有望发酵    0.713557\n",
       "1       “十四五”规划推出在即：国产替代迎加速 智能机器人站上风口    0.292492\n",
       "2        商务部支持集装箱制造企业扩大产能 相关公司望获益(附股)    0.871751\n",
       "3    又一行业涨价了：这次轮到PCB 景气度或延续至明年一季度(附股)    0.993395\n",
       "4           2020年12月4日涨停板早知道：七大利好有望发酵    0.678260\n",
       "5           2020年12月3日涨停板早知道：七大利好有望发酵    0.659060\n",
       "6           2020年12月2日涨停板早知道：七大利好有望发酵    0.598748\n",
       "7        12月金股出炉：大金融获强推 顺周期空间仍有余量（名单）    0.644080\n",
       "8             天空才是极限：比特币创历史新高 概念股名单奉上    0.259305\n",
       "9       11月券商金股收益放榜：最牛股暴涨近60% 12月名单奉上    0.007122\n",
       "10          2020年12月1日涨停板早知道：七大利好有望发酵    0.594757\n",
       "11             碳排放减少迫在眉睫 关注赛道稀缺龙头（附股）    0.347483\n",
       "12      后疫情时期，高瓴资本赛道配置：继续重仓电商 加仓云计算领域    0.999990\n",
       "13         2020年11月30日涨停板早知道：七大利好有望发酵    0.612599\n",
       "14         2020年11月27日涨停板早知道：七大利好有望发酵    0.582831\n",
       "15          世界5G大会举行：行业迎多重利好 投资机会有哪些？    0.990744\n",
       "16         2020年11月26日涨停板早知道：七大利好有望发酵    0.570923\n",
       "17              稀土价格持续上涨 相关公司或受关注（附股）    0.185043\n",
       "18       成功挤掉苹果：小米手机全球销量挺进前三 背后有这些供应商    0.868777\n",
       "19         2020年11月25日涨停板早知道：七大利好有望发酵    0.634730\n",
       "20                 特钢长牛时代来临？相关概念股持续走强    0.991833\n",
       "21         2020年11月24日涨停板早知道：七大利好有望发酵    0.633205\n",
       "22               年底冲关又看券商？11只券商股被坚定看好    0.037789\n",
       "23  小电车蕴含大市场 机构称未来三年单车产量复合增长率将超30%(股)    0.987369\n",
       "24         2020年11月23日涨停板早知道：七大利好有望发酵    0.624516\n",
       "25      *ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产    0.515381\n",
       "26             拉尼娜来袭 一文看清相关行业投资机会（附股）    0.974010\n",
       "27         暴雪肆虐冷空气“发威”：煤炭供应趋紧 这些厂商躺赢？    0.061959\n",
       "28         2020年11月20日涨停板早知道：七大利好有望发酵    0.581728\n",
       "29    三大运营商或于年底宣布5G消息商用 产业链标的有望受益（附股）    0.701960\n",
       "30        军工股午后崛起：航空产业链业绩提升 订单量增速有望扩大    0.931788\n",
       "31        国常会再提促进家电消费：家电股迎政策红利 两条主线布局    0.809421\n",
       "32        涨价题材火爆：有机硅价格创年内新高 最全概念股名单来了    0.258146\n",
       "33         2020年11月19日涨停板早知道：七大利好有望发酵    0.543657\n",
       "34      前三季中国拿下世界造船业半数订单 成全球重要造船中心(股)    0.828341\n",
       "35      手机摄像头出货量回暖：多摄趋势加速渗透 产业链有望持续受益    0.890235\n",
       "36         2020年11月18日涨停板早知道：七大利好有望发酵    0.734892\n",
       "37             能源工业云网正式发布 赋能能源产业链(附股)    0.043493\n",
       "38    10月装车辆同比翻倍：磷酸铁锂强势回归 龙头股价迭创新高(股)    0.930371\n",
       "39      全球首款定制网约车来了：滴滴出行携手比亚迪 概念股站上风口    0.520880\n",
       "40         2020年11月17日涨停板早知道：七大利好有望发酵    0.628731\n",
       "41      有色板块多股涨停：电解铝、稀土价格有望持续修复反弹(附股)    0.130005\n",
       "42      医美板块大涨：疫情趋稳需求恢复 三条赛道布局医疗美容(股)    0.668574\n",
       "43        疫苗超低温冰柜脱销 冷链板块有望重返高光时刻？(名单)    0.016453\n",
       "44       全球最大自贸协定达成：零关税产品超90% 概念股名单来了    0.062543"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import requests\n",
    "from bs4 import  BeautifulSoup\n",
    "import re\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "def request_url(url):\n",
    "    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'\n",
    "    headers = {'User-Agent': user_agent} \n",
    "    \n",
    "    res = requests.get(url,headers=headers)\n",
    "    res.encoding = 'utf-8'\n",
    "    return res.text\n",
    "\n",
    "url = 'https://finance.sina.com.cn/roll/index.d.html?cid=56588&page=1'\n",
    "soup = BeautifulSoup(request_url(url), 'lxml')\n",
    "\n",
    "info = [inf.text for inf in soup.find_all('a', target = '_blank')][2:]\n",
    "\n",
    "senti = [SnowNLP(text).sentiments for text in info]\n",
    "df = pd.DataFrame({'news': info, 'sentiments' : senti})\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "情感评分（0.6以上为积极，0.2一下为负面）： 0.58\n"
     ]
    }
   ],
   "source": [
    "print(\"情感评分（0.6以上为积极，0.2以下为负面）：\",np.round(np.mean(senti),2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 练习 \n",
    "\n",
    "通过分析新浪的新闻来判读舆情方向，并判断预测的准确程度。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from snownlp import SnowNLP # 使用\n",
    "import pandas as pd\n",
    "\n",
    "sina_news = pd.read_csv('sina_fin_news.csv', encoding = 'ansi')\n",
    "sina_news['date'] = [x.split(' ')[0] for x in sina_news['date']]\n",
    "sina_news['status'] = [SnowNLP(text).sentiments for text in sina_news['news']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "s_news = sina_news.groupby('date')['status'].transform('mean')\n",
    "sina_news['sentiment'] = ['Pos' if s_value > 0.6 else ('Neg' if s_value < 0.2 else 'None') for s_value in s_news]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>news</th>\n",
       "      <th>date</th>\n",
       "      <th>status</th>\n",
       "      <th>sentiment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>99</td>\n",
       "      <td>英国4月零售物价指数年率+2.5%，预期+2.6%，前值+2.5%；月率+0.4%，预期+0...</td>\n",
       "      <td>2014/5/20</td>\n",
       "      <td>0.069470</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>98</td>\n",
       "      <td>英国4月核心CPI年率+2.0%，创2013年9月以来最大升幅，预期+1.8%，前值+1.6...</td>\n",
       "      <td>2014/5/20</td>\n",
       "      <td>0.795471</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>97</td>\n",
       "      <td>英国3月DCLG房价指数年率+8.0%，预期+9.6%，前值+9.1%。</td>\n",
       "      <td>2014/5/20</td>\n",
       "      <td>0.541969</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>96</td>\n",
       "      <td>英国4月生产者输入物价指数月率-1.1%，预期-0.2%，前值-0.4%；年率-5.5%，预...</td>\n",
       "      <td>2014/5/20</td>\n",
       "      <td>0.057408</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>95</td>\n",
       "      <td>据世界黄金协会最新黄金需求趋势报告，一季度金条金币需求同比重挫39%至283吨，为四年来最低水平。</td>\n",
       "      <td>2014/5/20</td>\n",
       "      <td>0.999945</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>4</td>\n",
       "      <td>在圣彼得堡国际经济论坛上，道达尔CEO马哲睿(christophe de Margerie)...</td>\n",
       "      <td>2014/5/25</td>\n",
       "      <td>0.823601</td>\n",
       "      <td>Pos</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>3</td>\n",
       "      <td>雷石东(Sumner Redstone)过去约一周中出售2.36亿美元维亚康姆(VIA)和C...</td>\n",
       "      <td>2014/5/25</td>\n",
       "      <td>0.675677</td>\n",
       "      <td>Pos</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>2</td>\n",
       "      <td>苹果谋求禁止销售三星9款较老机型。此前苹果(AAPL)在美赢得一桩诉讼，裁决认定三星侵犯苹果...</td>\n",
       "      <td>2014/5/25</td>\n",
       "      <td>0.999985</td>\n",
       "      <td>Pos</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>1</td>\n",
       "      <td>郑商所期货与衍生品部总监左宏亮25日在“第九届中国期货暨衍生品市场论坛”上表示，从目前的郑商...</td>\n",
       "      <td>2014/5/25</td>\n",
       "      <td>0.892710</td>\n",
       "      <td>Pos</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>0</td>\n",
       "      <td>中金所期货小组成员刘炜亮25日在“第九届中国期货暨衍生品市场论坛”上表示，做市商不是期权推出...</td>\n",
       "      <td>2014/5/25</td>\n",
       "      <td>0.606726</td>\n",
       "      <td>Pos</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     id                                               news       date  \\\n",
       "0    99  英国4月零售物价指数年率+2.5%，预期+2.6%，前值+2.5%；月率+0.4%，预期+0...  2014/5/20   \n",
       "1    98  英国4月核心CPI年率+2.0%，创2013年9月以来最大升幅，预期+1.8%，前值+1.6...  2014/5/20   \n",
       "2    97               英国3月DCLG房价指数年率+8.0%，预期+9.6%，前值+9.1%。  2014/5/20   \n",
       "3    96  英国4月生产者输入物价指数月率-1.1%，预期-0.2%，前值-0.4%；年率-5.5%，预...  2014/5/20   \n",
       "4    95  据世界黄金协会最新黄金需求趋势报告，一季度金条金币需求同比重挫39%至283吨，为四年来最低水平。  2014/5/20   \n",
       "..   ..                                                ...        ...   \n",
       "995   4  在圣彼得堡国际经济论坛上，道达尔CEO马哲睿(christophe de Margerie)...  2014/5/25   \n",
       "996   3  雷石东(Sumner Redstone)过去约一周中出售2.36亿美元维亚康姆(VIA)和C...  2014/5/25   \n",
       "997   2  苹果谋求禁止销售三星9款较老机型。此前苹果(AAPL)在美赢得一桩诉讼，裁决认定三星侵犯苹果...  2014/5/25   \n",
       "998   1  郑商所期货与衍生品部总监左宏亮25日在“第九届中国期货暨衍生品市场论坛”上表示，从目前的郑商...  2014/5/25   \n",
       "999   0  中金所期货小组成员刘炜亮25日在“第九届中国期货暨衍生品市场论坛”上表示，做市商不是期权推出...  2014/5/25   \n",
       "\n",
       "       status sentiment  \n",
       "0    0.069470      None  \n",
       "1    0.795471      None  \n",
       "2    0.541969      None  \n",
       "3    0.057408      None  \n",
       "4    0.999945      None  \n",
       "..        ...       ...  \n",
       "995  0.823601       Pos  \n",
       "996  0.675677       Pos  \n",
       "997  0.999985       Pos  \n",
       "998  0.892710       Pos  \n",
       "999  0.606726       Pos  \n",
       "\n",
       "[1000 rows x 5 columns]"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sina_news"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
